78        Bioinformatics

mkdir indexdir

STAR --runThreadN 4 \

--runMode genomeGenerate \

--genomeDir indexdir \

--genomeFastaFiles ucscref/hg38.fa \

--sjdbGTFfile ucscref/hg38.ncbiRefSeq.gtf \

--sjdbOverhang 100

Above, with the “STAR” command, we used “--runThreadN” to specify the number of

threads used for indexing, “--runMode genomeGenerate” to tell the command that

we wish to generate a genome index, “--genomeDir” to specify the directory where the

index files are to be saved, “--genomeFastaFiles” to specify the file path of the reference

genome FASTA file, “--sjdbGTFfile” to specify the file path of the annotation GTF file, and

“--­sjdbOverhang” to specify the length of the genomic read around the annotated junction

to be used in constructing the splice junctions database. For this option, we can provide

read size minus one (n-1) if the read size is equal for all reads; otherwise, we can provide

the maximum size minus one.

The process of indexing may take a long time and may consume much memory and stor-

age space compared to the other aligners. Several files will be generated including binary

genome sequence files, files of the suffix arrays, a text file for the chromosome names or

lengths, splice junctions’ coordinates, and transcripts/genes information. Those files are

for the STAR internal use; however, the chromosome names can be renamed in the chro-

mosome file if needed.

The next step is to use STAR command for aligning the reads. This time we will use

“--runMode alignReads” to tell the program to run read mapping mode, “outSAMtype

BAM Unsorted” to generate an unsorted BAM file, “--readFilesCommand zcat” to tell the

program that the FASTQ files are compressed, “--genomeDir” to specify the index direc-

tory, “--outFileNamePrefix” to specify the prefix for the output files, and “--readFilesIn” to

specify the FASTQ file names. You can also set “outSAMtype BAM SortedByCoordinate”

to generate a BAM file sorted by the alignment coordinates. However, that will exhaust the

memory of a 32G-RAM computer.

STAR --runThreadN 4 \

--runMode alignReads \

--outSAMtype BAM Unsorted \

--readFilesCommand zcat \

--genomeDir indexdir \

--outFileNamePrefix STARoutput/SRR769545 \

--readFilesIn data/SRR769545_1.fastq.gz data/SRR769545_2.

fastq.gz

STAR alignment mode produces a BAM file, containing read alignment information,

and four text files, three log files with file names “*Log.out”, “*Log.progres.out”, and

“*Log.final.out”, where “*” is for the prefix specified by “--outFileNamePrefix” option,